Goto

Collaborating Authors

 normalization term




Understanding Transformer from the Perspective of Associative Memory

Zhong, Shu, Xu, Mingyu, Ao, Tenglong, Shi, Guang

arXiv.org Artificial Intelligence

In this paper, we share our reflections and insights on understanding Transformer architectures through the lens of associative memory--a classic psychological concept inspired by human cognition. We start with the basics of associative memory (think simple linear attention) and then dive into two dimensions: Memory Capacity: How much can a Transformer really remember, and how well? We introduce retrieval SNR to measure this and use a kernel perspective to mathematically reveal why Softmax Attention is so effective. We also show how FFNs can be seen as a type of associative memory, leading to insights on their design and potential improvements. Memory Update: How do these memories learn and evolve? We present a unified framework for understanding how different Transformer variants (like DeltaNet and Softmax Attention) update their "knowledge base". This leads us to tackle two provocative questions: 1. Are Transformers fundamentally limited in what they can express, and can we break these barriers? 2. If a Transformer had infinite context, would it become infinitely intelligent? We want to demystify Transformer architecture, offering a clearer understanding of existing designs. This exploration aims to provide fresh insights and spark new avenues for Transformer innovation.


8b5040a8a5baf3e0e67386c2e3a9b903-Reviews.html

Neural Information Processing Systems

Summary: This paper addresses the problem of conditional density estimation with a high dimensional input space (p n), an important problems as most (if not all) current models for nonparametric conditional density estimation do not scale to high-dimensions. Moreover, datasets with high dimensional inputs but relatively small sample sizes are becoming increasingly common. The model for the conditional density f(y x) is defined in three stages. First, a tree structure is defined over the input space. Second, given the tree structure, C_{j,k}, the k th partition of the X space at scale j, is mapped to a lower dimensional space.


Estimating Density Models with Complex Truncation Boundaries

Liu, Song, Kanamori, Takafumi

arXiv.org Machine Learning

Truncated densities are probability density functions defined on truncated input domains. These densities share the same parametric form with their non-truncated counterparts up to a normalization term. However, normalization terms usually cannot be obtained in closed form for these distributions, due to complicated truncation domains. Score Matching is a powerful tool for fitting parameters in unnormalized models. However, it cannot be straightforwardly applied here as boundary conditions used to derive a tractable objective are usually not satisfied by truncated distributions. In this paper, we propose a maximally weighted Score Matching objective function which takes the geometry of the truncation boundary into account when fitting unnor-malized density models. We show the weighting function that maximizes the objective function can be constructed easily and the boundary conditions for deriving a tradable objective are satisfied.


Structure Learning of Partitioned Markov Networks

Liu, Song, Suzuki, Taiji, Sugiyama, Masashi, Fukumizu, Kenji

arXiv.org Machine Learning

We learn the structure of a Markov Network between two groups of random variables from joint observations. Since modelling and learning the full MN structure may be hard, learning the links between two groups directly may be a preferable option. We introduce a novel concept called the \emph{partitioned ratio} whose factorization directly associates with the Markovian properties of random variables across two groups. A simple one-shot convex optimization procedure is proposed for learning the \emph{sparse} factorizations of the partitioned ratio and it is theoretically guaranteed to recover the correct inter-group structure under mild conditions. The performance of the proposed method is experimentally compared with the state of the art MN structure learning methods using ROC curves. Real applications on analyzing bipartisanship in US congress and pairwise DNA/time-series alignments are also reported.


Support Consistency of Direct Sparse-Change Learning in Markov Networks

Liu, Song (Tokyo Institute of Technology, Japan) | Suzuki, Taiji (Tokyo Institute of Technology, Japan) | Sugiyama, Masashi (University of Tokyo, Japan)

AAAI Conferences

We study the problem of learning sparse structure changes between two Markov networks P and Q. Rather than fitting two Markov networks separately to two sets of data and figuring out their differences, a recent work proposed to learn changes directly via estimating the ratio between two Markov network models.  Such a direct approach was demonstrated to perform excellently in experiments, although its theoretical properties remained unexplored.  In this paper, we give sufficient conditions for successful change detection with respect to the sample size np, nq, the dimension of data m, and the number of changed edges d.


Group Sparse Priors for Covariance Estimation

Marlin, Benjamin, Schmidt, Mark, Murphy, Kevin

arXiv.org Machine Learning

Recently it has become popular to learn sparse Gaussian graphical models (GGMs) by imposing l1 or group l1,2 penalties on the elements of the precision matrix. Thispenalized likelihood approach results in a tractable convex optimization problem. In this paper, we reinterpret these results as performing MAP estimation under a novel prior which we call the group l1 and l1,2 positivedefinite matrix distributions. This enables us to build a hierarchical model in which the l1 regularization terms vary depending on which group the entries are assigned to, which in turn allows us to learn block structured sparse GGMs with unknown group assignments. Exact inference in this hierarchical model is intractable, due to the need to compute the normalization constant of these matrix distributions. However, we derive upper bounds on the partition functions, which lets us use fast variational inference (optimizing a lower bound on the joint posterior). We show that on two real world data sets (motion capture and financial data), our method which infers the block structure outperforms a method that uses a fixed block structure, which in turn outperforms baseline methods that ignore block structure.